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[57] ABSTRACT 

A compiler and method for optimizing a program based on 
branch probabilities, branch frequencies and function fre- 
quencies. A number of algorithms executed by the compiler 
determine statically from the program code the probabilities 
that branches with the program are taken and how often the 
branches arc taken. With this information, the compiler 
arranges the object code in memory to improve execution of 
the program. The frequency of functions within the code 
may be determined from the branch probability and branch 
frequency information. The compiler uses the function fre- 
quency information to arrange the functions in a desirable 
order, such as storing function pairs with the highest global 
call frequencies on the same memory page. This minimizes 
the number of calls to functions that are stored on disk and 
thus improves the speed of execution of the program. 
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OPTIMIZING COMPILER WITH STATIC lativc instruction execution, since it is uncertain whether the 

PREDICTION OF BRANCH PROBABILITY, parallel-executing instructions actually need to be executed 

BRANCH FREQUENCY AND FUNCTION To reduce this uncertainty, compilers attempt to assess a 

FREQUENCY program's likely instruction path through the program's 

5 various branches. To select a profitable optimization, a 
compiler must first predict how often portions of a program 
FIELD OF THE INVENTION execute. Once the more frequently executed portions are 

This invention relates to computer language translators identified, any of a number of well known optimizations can 
such as compilers that translate source code into object code. be a PPKed to these portions. These optimizations include 
More particularly, this invention relates to compilers which 10 ranging the sequence of object code so that the more 
optimize the object code they generate by applying code- frequently executed portions follow each other and can be 
improving transformations. executed in parallel. 

Another reason for optimizing the order of instructions is 
BACKGROUND OF THE INVENTION t0 reduce cache misses in computers that utilize cache 

A compiler is a computer program that reads a program 15 memory between the CPU and main memory. Instructions 
written in one language— the source language— and trans- be arranged so that those most likely to be executed 

lates it into an equivalent program in another language— the sequentially are stored in the same cache line or block. Thus 
target, or object, language. Common source languages are wneD a cacne Une is accessed for an instruction, the instruc- 
human readable languages such as FORTRAN, BASIC and tions most t0 follow *e &° immediately available to 
C. Programs written in a source language are comprised of 20 ^ e CPU. 

source code that consists of a series of instructions. Object Determining the more frequently executed portions of a 
languages are comprised of assembly language or machine program is often done through a process known as profiling, 
language for a target machine such as an Intel Dynamic profiling consists of compiling and then executing 
microprocessor-based computer. a program to collect the execution frequencies of the pro- 

There are two parts to a compiler: analysis and synthesis. 25 S cam portions. Most profiles result from dynamically count- 
The analysis part breaks up the source program into con- m S events during a program's execution. Based on these 
stituent pieces and creates an intermediate code representa- counts, a compiler can identify the frequently executed code 
tion of the source program. The synthesis part constructs the and optimize it with the benefit of this information, 
equivalent object program from the intermediate code. 30 However, dynamic profiling has a number of drawbacks. 

The analysis part of a compiler includes lexical, syntax First : Staining a profile of each program to be compiled 
and semantic analysis and intermediate code generation. requires compiling and executing the program twice, once to 
This part is often referred to as the ••front end*' of a compiler obtain the program's profile and once to optimize the code 
because the part depends primarily on the source language y ith the benefit of the profile iiiforrnaUon. Second, it is often 
and is largely independent of the target machine. Briefly, 35 impractical to profile real time and reactive systems. Third, 
lexical analysis consists of reading the characters of the optimization based on dynamic profiling is not automatic, 
source program and grouping them into a stream of tokens. but T ^ ts Fogrammer intervention to provide the input 
Each token represents a logically cohesive sequence of ^ * e IW™ m . me optimizing ; process. End 

characters. Syntax analysis then groups, or parses, the users are untraned in dynamic profiling and in using the 
tokens of the source program into grammatical phrases that 40 P rofihn S mfonnation to optimize programs they write, 
are used by the compiler to synthesize output. Semantic An alternative is static profiling, in which a compiler 
analysis then checks the parsed source program for semantic estimates relative frequencies (not absolute counts) through 
errors. After performing syntax and semantic analysis, the a static analysis of the program's code. Static analysis relies 
compiler generates intermediate code from the parsed source u Pon heuristics (commonly observed program behaviors) 
program. The intermediate code is written in an intermediate 45 for predicting what portions of a program most frequently 
language and consists of a series of instructions. execute. Heuristics are derived througjh observation of pro- 

The synthesis part of a compiler typically includes code P 33 ™ and typically are given as a probability, e.g., a chance 
optimization and object code generation. This part is often ■ ** a branch of a ccrtain «ffl bc bv a Program, 
referred to as the "back end" of the compiler because the Since sta tic analysis does not require executing the program 
code generated depends on the target machine language, not 50 to ob ^ ain ^ V s0 ®* Mormation » foe drawbacks of dynamic 
the source language. Code optimization attempts to improve profiling are avoided 

the intermediate code for the program so that faster running A prime example of present static profiling techniques is 
machine code will result. Object code generation then gen- described by Thomas Ball and James Laws in a 1993 paper 
crates object code from the improved intermediate code by, entitled ''Branch Prediction for Free " which is hereby 
among other things, translating each intermediate code 55 incorporated by reference. In their paper, Ball and Lams 
instruction into a sequence of machine instructions that describe a number of heuristics they may apply to branches 
perform the same task. in a program's code to predict whether the branch will be 

The use of instruction-level parallel processing in newer taken. These heuristics include, for example, a prediction 
CPUs such as the Intel Pentium™ microprocessor has (ye 8 01 no ) mat a comparison of a pointer against a null in 
increased the need to optimize the order of instructions. With 60 ^ ^ statement will fail. Based on these binary branch 
parallel processing, following instructions are executed in predictions, a compiler can estimate what portions of the 
parallel with preceding instructions. However, if the preced- program are most likely to be executed, 
ing instructions include a branch, then execution of the Typically, several heuristics apply to a branch. Ball and 
following instructions is unnecessary if the preceding Larus predict a branch's outcome with the first heuristic; — 
instructions branch away from the following instructions. 65 from a pre-computed, static priority ordering — that applies 
Hie CPU instead must execute the instructions that the to a branch and disregard the other heuristics. This approach 
branch leads to. This circumstance is referred to as specu- works weU for branch prediction, which simply produces a 
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yes or no. However, it ignores valuable statistical informa- FIG. 4 is a flowchart showing the steps for computing 
tion. Each heuristic is determined from empirical data, and branch probabilities from heuristic probabilities, 
associated with each heuristic is a statistical probability that FIGS. 5A-C are example control flow graphs for com- 
the branch will be taken. It is this probability that provides puting block and branch frequencies, 
the basis for the binary prediction. For example, the heuristic 5 FIG. 6 is another example control flow graph of a pro- 
mentioned above empirically may have a 60% chance of cedure. 

being correct, and thus the prediction is that the branch will mGi 7 is a fl owcriar t showing the steps for storing the 

occur since the comparison fails most of the time. But mis ob j cct codc in m OTdcr bascd on fr cqucDC i cs . 

statistical information is not used beyond deterrruning the FIG. 8 is a data structure diagram of a call graph of the 

P^^on- 10 functions within a program. 

The primary drawback of static profiling techniques to mG 9is&n example control fiow ^ph. 

date is ^toa^topctktog program behavior. The nG 1Q [s ^ b COfltrol flow h with bnmch 

approach suggested by Bell and Larus as well as other static probabilities added, 

profiling approaches suggested by others are not as accurate . A . . a . . 

_ &6 7 t c FIG. 11 xs the example control flow graph with block and 

as dynamic profiling. 15 uu«_ T, 

J t . r , , . . , . ._, branch frequencies added. 

An object of the invention, therefore, is to provide an t ~ A - , „ , . . , , „ 

i „ ^. . . a £ j *. * * -C— *i PIG- 12A is an example call graph showing local call 

improved static profiling method for determining frequently frequencies 

executed portions of a program. Another object of the „ ' . . . „ L . . , , „ 

invention is to provide an optimizing compiler that employs fr na X . 2B ls * e exara P le ^ showin S S lobal ^ 

this method m optimizing the compflation of source code » frequenaes and invocation frequencies, 

into object code. DETAILED DESCRIPTION OF A PREFERRED 

EMBODIMENT 

SUMMARY OF THE INVENTION 

FIG. 1 is a block diagram of a computer system 20 which 

In accordance with the invention, a method of compiling £ s usec i t0 implement a method and apparatus embodying the 

a computer program utilizes the probabilities from a number 25 invention. Computer system 20 includes as its basic ele- 

of applicable heuristics to determine what branches of a mcnts a comp uter 22, input device 24 and output device 26. 

program are most likely to be taken. In one aspect of the Computer 22 generally includes a central processing unit 

mvenUon,atableofproba^^^ (Cpu) 28 md a m m 30 ^ communicate 

of heuristic predictions is stored. The compiler generates a bus structUfC 32 ^ ^ ^ arfthmctic 

intermediate code stores it partitions it into basic blocks, lo ^ c unft (ALU) 33 fQf computat i ons , reg i s ters 

and then stores the basic blocks in a data structure that 34 for temporary storage of data and instructions and a 

includes branches to other basic blocks. For the branches, control ^ 36 for controllin me tion of ter 

the heuristic predictions that apply to a branch are deter- ^ 20 ifl nsc fc mstructioDS from a ^ tsr 

mined. The probabilities associated with the heuristics that ^ such &$ m Uciltion „ an atin tera 

apply to a branch are combined to compute a probability of ,T _ * ~n „ . , . ' . 

the branch being taken by the programmed code is then Memory system 30 generally includes high-speed mam 

generated and stored in an ord« based on the branch ™™Z rel™* T / T £™f m ^ 

ob biliti memory (RAM) and read only memory (ROM) semicon- 

e5> . , . ductor devices and secondary storage 40 in the form of a 

In another aspect of the invention, branch probabilities are medium such as fl hard ^ te CD-ROM, 
used for computing branch frequencies and block frequen- ctCt md othcr devices mat use optical or ^ncte reC ording 
cies. The branch frequencies, denved from the branch material. Main memory 38 stores programs such as a corn- 
probabilities, may then be a more direct basis for storing the puter > s operatmg system and currently running application 
object code in a given order to optimize program execution. programs. Main memory 38 also includes video display 

In another aspect of the invention, function invocation 45 memory for displaying images through a display device, 

frequencies and function call frequencies are computed and ^ dcvicc ^ md output ^ m ^fc^y penph- 

combined to obtain global call frequencies for calling and ^ d eY i C es connected by bus structure 32 to computer 22. 

called function pairs f,g. These global call frequencies are j^put device 24 may be a keyboard, modem, pointing 

then used by the compiler to order and store functions within device, pen, or other device for providing input data to the 

multiple object files to improve the likelihood that a calling 50 computer. Output device 26 may be a display device, printer, 

function and its called function are located within a same sound device or other for providing output data from 

virtual memory page. This reduces the need for disk access, me computer 

further increasing the speed of program execution. It should bc understood fcat ^ j is a block 

Theforegomg and othcr objects, features, and advantages illustrating the basic elements of a computer system; the 

of the invention will become more apparent from the fol- 55 figure is not intended to illustrate a specific architecture for 

lowing detailed description of a preferred embodiment and a computer system 20. For example, no particular bus 

the accompanying drawings. structure is shown because various bus structures known in 

BRIEF DESCRIPTION OF THE DRAWINGS t ^ ie °^ com P uter design may be used to interconnect the 

elements of the computer system in a number of ways, as 

FIG. 1 is a block diagram of a computer system that may 60 desired. CPU 28 may bc comprised of a discrete ALU 33, 

be used to implement a method and apparatus embodying registers 34 and control unit 36 or may be a single device in 

the invention. w hich these parts of the CPU are integrated together, such as 

FIG. 2 is a data flow diagram of a compiler embodying the in a microprocessor. Moreover, the number and arrangement 

invention. of the elements of the computer system may be varied from 

FIG. 3 is a control flow graph of a procedure, including a 65 what is shown and described in ways known in the art (i.e., 

diagram of a basic block data structure according to the multiple CPUs, client-server systems, computer networks, 

invention. etc.). 
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FIG. 2 is a data flow diagram of a compiler 50 embodying analysis, compiler back end 58 accesses a look up table 72 

the invention. The compiler includes a front end 52 that in memory system 30 which contains probabilities associ- 

receives as input a source file 54 which includes source code ated with a plurality of heuristic predictions. These prob- 

written in a high level language such as C. The front end of abilities are derived from empirical data from executed 

the compiler is conventional in nature and may include, for 5 computer programs. While various groups of heuristics may 

example* a lexical analyzer, a syntax analyzer, a semantic be employed in compiler 50, the following heuristics are 

analyzer. The front end 52 also includes a code generator presently used: 

that generates intermediate code from the source code based Loop branch heuristic (LBH). Predict as taken an edge back 

on these analyses. This intermediate code is then stoed in a to a loop's head. Predict as not taken an edge exiting a 

file 56 in memory system 30, The file 56 is preferably 10 joQP* 

permanently stored in some type of secondary storage 40 Pointer heuristic (PH). Predict that a comparison of a pointer 

and copied into main memory 38 as needed for processing. nul f °. r of two pointers will fail. 

However, storage in secondary storage 40 is not required. Opcode heuristic (OH). Predict that a comparison of an 

A back end of the compiler 50 is indicated by dashed ^t^^^^f^ % * ^ * ° f 

block58 TliecompUerbackend58iDdud K ^ 15 ^ heuris ^ ^^edict that a comparison in which a 

code analysis portion 60 an opttrmzauon porhon 62 and a is m d ^ ister ^ sed befare bek 

code generator portion 64. The code analysis portion 60 defined ^ a successQr Wo ^ and me succesSQr block does 

analyzes the intermediate code and partitions it into basic not post . dominatc will rcach ^ succcssor blocL 

blocks. Typically, each function or procedure in the inter- L exit heuristic ^ a comparison m a 

mediate code is represented by a group of related basic 20 loopin which no successor is a lc<)p head will not exit the 

blocks. As understood in the art, a basic block is a sequence , * 

of consecutive statements in which flow of control enters at L header heuiistk a successor ^ is a 
the begimung and leaves at the end without branching x header or ft { pre-header and does not post- 
except at the end. For example, a block of assignment dominate will be taken. 

statements is a basic block because there is no possibility of 25 CaU hcuristic {CR) a successor ^ contains a ^ 

branching m the middle of the block. On the other hand, a md dofiS nQt wiU not be tskaL 

block of intermediate code statements that includes a branch Storchcuristic(S H). Predict a successor that contains a store 

in the middle of the block (generated perhaps by an IF and does not post-dominate will not be taken, 

statement m the source code) is not a basic block because of Rcturn hemistic ^ Predict a successor mat a 

the possibility of branching from the middle of the block. 30 ^ QOt ^ 

The basic blocks of mtermediate code are then stored by ^ x below shows ^ of loofc ^ n of 

compiler 50 into basic block data structures 70. ^ 

FIG. 3 is a control flow graph of a procedure (i.e. t a 
function) that includes a diagram of a basic block data 

structure 70. The graph is comprised of a group of nodes, 35 
which are the basic blocks of a function, and edges, which 
are the branches. Node Bl is the initial node of the function; 
it is the block whose leader is the first statement of the 
function. There are edges from block Bl to blocks B2 and 

B3. indicating branches to these following blocks. Block B2 40 
is a successor of block Bl, and Bl is a predecessor to B2. 
Similarly, blocks B4 and B5 are successors to B2. 

As stored in the memory system 30, data structure 70 
includes a number of relevant fields for holding data ele- 
ments and their attributes. These include a field 70a for 45 
storing the sequence of intermediate code instructions that 

comprise the basic block. Afield 706 contains a value for the Other heuristics that may be used are the following: 

block frequency, which, as will be explained, is the number Looplndex heuristic (LIH). Predict that a comparison of an 

of times a basic block typically executes within the proce- integer value equal to a loop index variable will fail, 

dure. Fields 70c contain pointers to other basic blocks. These 50 Pointer Guard heuristic (PGH). Predict that a comparison in 

pointers indicate the possible branches from the basic block which a pointer is an operand and the pointer is used 

to other basic blocks. For example, the data structure for before being defined in a successor block, and the suc- 

block B2 includes in fields 70c pointers to blocks B4 and B5. cessor block does not post-dominate will reach the sue- 

Attributes for fields 70c include a branch probability value cessor block. This heuristic overrides Pointer and Guard 

in a field 70i and branch frequency value in field 70c. The 55 heuristics. 

branch probability is an estimate of the likelihood that the OPvar heuristic (OPH). Predict the comparison of two 

branch will be taken. The branch frequency is the number of variables for equal will fail 

times the branch is taken, and is the product of the branch OPMax hcuristic (OFMXH). Predict that a comparison of a 

probability and block frequency. variable greater than or greater or equal to a variable with 

Heuristic Predictions 60 a name that contains one of the following patterns will 

Returning to FIG. 2, the optimization portion 62 of fail: max, most, largest, biggest, size, upper, length, 

compiler back end 58 performs a number of conventional OPMin heuristic (OPMNH). Predict that a comparison of a 

optimizations. However, it also includes branch analysis 62a variable less than, or less than or equal to, a variable with 

for performing novel optimizations in accordance with the a name that contains one of the following patterns will 
invention. The branch analysis is applied to the basic block 65 fail: min, least, smallest, low. 

data structures to determine the block and branch probabili- Abnormal heuristic (AH). Predict that a successor that 

ties and frequencies described above. As part of the branch contains a call to a function whose name contains one of 



TABLE 1 




Heuristic 


Heuristics 


Probability 


Loop branch (LBH) 


88% 


Pointer (PH) 


60% 


Opcode (OH) 


84% 


Guard (GH) 


62% 


LOOP exit (LEH) 


80% 


Loop header (LHH) 


75% 


Call(CH) 


78% 


Store (SH) 


55% 


Return (RH) 


72% 
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the following patterns is not taken: warn, exit, abort. Combining heuristic probabilities from several heuristics 

abend aberr, quit, error, kill, fatal, iongjmp.This heuristic produces a branch probability that is more accurate than if 

overrides the Call heuristic. only one heuristic probability alone were used. For example, 

Abort heuristic (ABH). Predict that the successor that con- j n the following code: 
tains a call to a function whose name is one of the 5 

following strings is not taken: exit _exit, abort, abend, tf (km ^ ^ cLje {jPsl; ^ivm;} 
aberr, quit, error, perrot, kill, fatal, longjmp, siglongjm. 

This heuristic overrides Call and Abnormal heuristics. Dotn me qh and RH heuristics apply. OH suggests that the 

Debug heuristic (DH). Predict that a comparison of a debug else-branch is taken, but RH claims that the then-branch is 

variable for none-zero will faiL A variable whose name ^ The ^ or ^ a h ^ Qrderin heuristics ^ 

contains one of the foUowing patterns is a debug variable: gelectiDg ^ fat &e CQnflict by m ^ 

error, debug -verbose, trace. reasonably predicts the else-branch, but it results in a 84% 

Errno heuristic (EH). Predict that a comparison of errno far , y ... , ' ., - _„ 

non zero will fail probability for this branch. The negative evidence from RH 

CMP heuristic (CMPH). Predict that a call to a function ^ uabl y should reduce me ^ability, 
whose name contains one of the following patterns returns 15 Dempster-Shafer theory provides a mathematical tech- 
a non-zero value: cmp, compar. nique for combining values such as these heuristic prob- 

Status heuristic (STH). Predict that a system call will return abilities into a meaningful prediction of the probability of an 
normal status. outcome. It starts from a basic probability in the range [0,1]. 

Range heuristic (RGH). Predict that a check for a variable This value is the degree to which evidence supports a 
out of a range will fail. 20 hypothesis. For branch probability estimation, the hypoth- 

When a heuristic predicts exactly one successor of a esis is: "branch b is taken" or "a branch other than b is taken 
branch as taken, the heuristic applies to the branch. If a (b is not taken)." The evidence is that a heuristic predicts the 
heuristic applies to a branch and the predicted taken branch branch. The basic probability is the heuristic probability, 
is actually taken, the heuristic hits. The percentage of If more man one heuristic supports or denies a hypothesis, 
predictions that hit is the heuristic' s hit rate. A hit can be 25 Dempster-Shafer theory provides an elegant way to combine 
treated as the successful outcome of a binary experiment. By heuristic probabilities. Assume an event has a set of k 
repeating the experiment N times, M (M^N) true outcomes exhaustive and mutually exclusive possible outcomes 

and N-M false outcomes are obtained. If N is reasonably A={A X , A 2 AJ. Each subset of A has a corresponding 

large, the hit ratio M/N approximates the probability of a hypothesis that the events in the subset occur. A piece of 
successful outcome. A heuristic' s hit rate is a good estimate 30 evidence assigns a value in [0, 1] to every hypothesis (subset 
of the probability that the predicted branch will be taken at 0 f A), so the values for the evidence sum to 1. This value 
run time. For example, if PH*s hit rate is 60%, PH predicts indicates the likelihood that the event occurs. The empty set 
that the branch will fail 60% of the time when heuristic PH i s assigned 0. This assignment is called a basic probability 
applies to a branch, assignment (denoted by function m). For example, a branch 

Computing Branch Probabilities 35 b->{ Dl , b 2 bjj has k exhaustive and mutually exclusive 

The compiler 50 computes branch probabilities from outcomes A={b x , b 2 , . . . , b k }. If a heuristic predicts the 
identified heuristic predictions using table 72 and basic probability of taking b f is u and the probability of not taking 
blocks data structure 70. FIG. 4 is a flowchart that illustrates b, is 1-u, the following basic probability assignment is 
how this is done. The following steps are applied to the basic obtained: m^^ })=u and m^A-lbJ^l-u. If another heu- 
blocks in each function in the program being compiled. As 40 n St i c predicts the probability of taking b t is v, another basic 
an initial step, branch analysis 6Za analyzes the sequence of probability assignment is obtained: 
instructions in field 70a of a basic block and its successor 

blocks data structures 70 to determine which heuristics ^({fr,})^ and m^A-M^i-v. 

apply to the branch (80). The applicable heuristics are then 

looked up in table 72 to determine their associated prob- 45 Let m^ and m^ be two basic probability assignments. The 
abilities (82). If more than one heuristic prediction applies, Dempster-Shafer algorithm computes a new combined 
the associated probabilities of the applicable heuristics are assignment, denoted m^rr^, that combines the evidence 
combined to compute a probability of the branch being taken fro m both assignments. Far a subset B of A: 
by the program (84). (If only one heuristic prediction 

applies, its probability is stored as the branch probability). 50 imi(X)mi(Y) 
The computed branch probability is then stored in the basic mi®m£B) = r J n 1 (U)m 2 (W) 

block data structure field 70d As will be explained, these 
branch probabilities may be used in a number of ways for 
optimizing the object code of the program. For example, the 

branch probabilities may be used as a basis for storing the 55 . ^ , _ t , 

object code in a desired order in a file so that the most ^cormnon. To contmue the exar^le from above, when b=b,. 
frequently executed portion of the object code is stored *»go heunstics predict the same outcome) only toe 

together. This promotes theparallel execution of instructions ?. ubs K ets ^ _{b ' } K may BOn -»5? ba f c p '° bablb ; 
in CPUs that have this capability. bes because 311 other subsets > S - bavem^S) andm^S) equal 

The applicable heuristic probabilities may be combinedin 60 t0 ^° u To ^ d * e ~ mbined bas \ c F*^' 
a number of ways. In the preferred embodiment, they are m «< { ^>>?» W pro * uce ? ^ ?* for «" £ ther ^ sets X 
combined according to the following-described algorithm, and Y, if meir intersecbon ts {b,h Aenm 1 (X)m ? (Y)iszero. 
which is derived from the Dempster-Shafer theory of evi- F^ermore, m^b,}) mrfA^Wl-mJd-H), so: 
dence described in A Mathematical Theory of Evidence, ^ 
Princeton University Press, 1976. This algorithm combines 65 miBm^bi) ==— j^— ^ — 
probability predictions of all applicable heuristics into a 
branch probability. 



where X and Y run over all subsets of A whose intersection 
is B and U and W are subsets of A with at least one element 
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-continued 
m , A fklv (l-MXi-v) 



In this case, ra 1 ©m 2 ({b i })>m 1 ({b f }) if and only if 
nh({b(})>0.5 and m 1 ©m 2 ({b,})>m 2 ({bj) if and only if 
m^bJJsO-S. This shows that an estimation that b, occurs 
less than half of the time lowers the probability of another 
prediction of the same outcome. 

Consider the case when b^ (the heuristics predict 
different outcomes). If k=2. one has the same case as b=b- 
by using b/=A-{b,}. If 2<k: 



Md-v) 
1 -pv 



v(l-M) 



mi®mi({b b bj}) = 



1 -JJV 



15 



20 



In this case, m A ©m 2 ({b i }>m x ({b / }) if and only if m 1 ({b i }) 
=1 or 0. This shows that a contradictory prediction lowers 
the probability unless one prediction is certain (1 or 0). 

As a concrete example, suppose b-^b^ b 2 } initially (in 
the absence of a prediction) has an equal probability of 
branching to b t and b 2 (m 1 ({b t })=m l ({b 2 })=0.5). If a heu- 
ristic predicts that b-4bj occurs 70% of the time (ro^b,}) 
=0.7 and m 2 ({b 2 })=0.3), the combined probabilities are: 

^ * r • i\ 0.5x0.7 rt _ 

^O^'P" 0,5x0.7+03x0.3 =0.7 



m „ 0.5 x 0.3 

mi©m2«fr2» - Q.5X0.7+0.5X0J 



= 0.3 



0.7x0.6 
: 0.7x0.6 +03X0.4 



= 0.78 



25 



30 



35 



Now suppose another heuristic estimates that b-»bl is 
taken 60% of the time (m^bj)^ and m 3 ({b 2 })=0.4). 
The estimate then becomes: 



40 



10 



TABLE 2 



50 



/niSm*Bm 3 ({fc})= 0 . 7 J;t*^ xOA =022 



The second heuristic increased the probability that b x is 
taken from 0.7 to 0.78. This process can be repeated, in any 
order, to incorporate other heuristics as the operator © is 
associative. 

Table 2 below shows pseudocode that describes this 
algorithm as it is executed by the branch analysis 62a of 
compiler back end 58. The algorithm computes the prob- 
ability for two-way branches by combining predictions from 
all applicable heuristics. For multiway (>2) branches, it 
assigns equal probability to each outcome since no heuristics 
predicts these branches. If heuristics are developed for 
multiway branches, the algorithm can use the general 
Dempster-Shafer algorithm to combine the basic branch 
probabilities. 

A similar algorithm can also combine the probabilities 
from dynamic profiles. A common way to combine these 
profiles is to add counts for each branch, which weights a 
profile in proportion to its execution length. By first con- 
verting counts into predictions of branch probabilities, 65 
Dempster-Shafer theory can combine profiles without this 
bias. 



Input: Control-flow graph O for function. Each node is a basic 
block and an edge b^bj represents a branch from block b A to bj. 
5 For each heuristic H, the predicted taken probability is 

takcu_prob(H), and the not taken probability is not__takea_prob(H). 

Output: Assignment of a branch probability prob(bi-»bj) to each 
edge bj-^bj in G. 

10 Process: 

foreach block b with n successors 

and m back edge successors (m S n) do 
if n = then // No successors 

continue; 
else if b calls exit() then 
foreach successor s of b do 
prob(b->s) = 0.0; // Never reach successors 
else if m > 0 and m < n then 
// Both back edges and exit edges 
foreach back edge successor s of b do 
prob(b-»s) = taken__prob(LBH) / m; 
foreach exit edge successor s of b do 
prob(b-4s) = aol_takeruprob(LBH) / (n - m); 
else u*m>0orn£2 then 
// Only back edges, or not a 2-way branch 
foreach successor s of b do 
prob(b-4s) = IX) / n; 

II None of the above 
let Sj and Sj be the successors of b 
prob(b— ►Sj) = prob(b-^s 2 ) = 0.5 
foreach heuristic H that applies do 
Assume H predicts (b->9 t ) taken, 

and (b— ^ not taken 
d = proXb-vs,) x taken_prob(H) 

+ prob(b->S2) x not_taken prob(H); 
prob(b->Si) = prob(b->Sj) x takcn_prob(H) / d; 
prob(b— ^Sj) = prob(b->s 2 ) x not_takcn__prob(H) Id; 



Computing Block Frequencies and Branch Frequencies 

There may be a number of ways that object code may be 
optimized based on the determined branch probabilities. In 
the preferred embodiment, the branch probabilities are the 
basis for computing block frequencies and branch frequen- 
cies. After computing branch probabilities, compiler 50 
calculates intra-procedural (or local) basic block and control 
flow graph (CFG) edge frequencies by propagating branch 
probabilities over a single procedure's control-flow graph. 
The frequency of a branch b^bf, is the frequency of block 
b ( times the branch probability of b ( — »b ( .. The frequency of 
block b, is the sum of the frequencies of incoming edges. Let 
bfreq(b / ) be the frequency of block b f and freq(bj->b,) be the 
edge frequency of b,— »b ( . Assume pred(b) is the set of 
predecessor blocks of b t . The following flow equations state 
this relation precisely: 



55 



60 



6freq(f>i) = 1 (if b; is the entry block) 
fcfreq(fe/) = Irreq(&> -> inXotherwise) 

frp6pred(fr/) 
freq(f>i -> bi) = 6freq(bj) x prob(fci -» fc;) 



For a flow graph without cycles, these equations can be 
solved top-down in a single pass. When a graph contains 
cycles, these equations are mutually recursive and must be 
solved by finding a least fixed point. FIGS. 5A-C and 6 are 
graph representations of several procedures* basic blocks. 
Referring to FIG. 5A, consider first a structured flow graph 
in which a single loop head dominates a loop body (this 
could be a single loop or nested loops that share the same 
head). In the flow graph, block b 0 is the loop head, in_freq 
(bo) is the total frequency of the edges (excluding the back 
edges) entering b 0 , and blocks b lt b 2 , . . . , b* contain back 
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edges leading to b 0 . Since b 0 is the only entry to the loop, 
one can propagate bfreq(b 0 ), without recursion, to b^ b 2 , . 
. . , b^ and obtain r t . far i=l , .... k, where r ( . is the probability 
that control passes from b 0 to b ( . From this, one finds: 



&freq(£?o) = 



in— tcq(ba) + X freq(bj ■ 
i=l 



12 



The algorithm described above computes branch and 
block frequencies for a procedure. Pseudocode representing 
the steps taken by compiler 50 for executing the algorithm 
is contained in Table 3 below. It assumes that the flow graph 
is reducible. The algorithm also terminates for non-reducible 
flow graphs, although the resulting estimates may be less 
accurate. 



= fn_Jreq(£>o) + X (bteq(bi) X prob(6f -> bo)) 



= (VuJreqfo) + £ (fcfreq(i 0 ) x n x prob(£>i bo)) 



= in_fieq(fio) + £>freq(6o) X (r, x prob(£>; -> fco)) 
i'=l 

Let 

Pi = rj x prob(£>, ->■ 

it jfc 
cp(*o) = ^ (n x prob^i -> bo)) = ^ Pi, 

if 0 S cp(b 0 ) < 1, one has: 

£>freq(fc 0 ) = iR_£eq(6 0 ) +■ bfaq(b 0 ) x cp(fc c ) 



TABLE 3 



10 



15 



20 



25 



i/i_Jreq(&o) 
= l-cp(ba) 

In this derivation, p,- is the probability that control goes 
from b 0 to b 0 through block b f , and cp(b 0 ) is the probability 
along all paths that control goes from b 0 to b 0 . This cp(b 0 ) 
is called the cyclic probability of block b 0 . To find the cyclic 
probability, first assume b 0 executes once and propagate 
branch probabilities from b 0 to all back edges leading to b^ 
and sum the probabilities of the back edges. 

Applying this formula to the examples in FIGS. A-C, one 
gets: 



£freq(£o) = 



1-0.5x0.88-0.5x0.88 



- = 8.33 



bfltqtyimcr) = 



l-<7(tw) 



to find the frequency of the inner loop head, where b, 
the head block of the inner loop structure and cp(b^ 
b tW /s cyclic probability. 

If a flow graph is reducible, every loop head dominates 
the blocks in the loop. The method described above finds the 
correct branch frequencies for these flow graphs. The inner- 
most loop is visited first and the cyclic probabilities of inner 
loops axe used to compute frequencies for the outer loops. 



30 



35 



40 



for the flow graph in FIG. 5B, and: 

frfrcq^ A -oi8 -0.88 x 0.12-0.^ XO.il X 6.12 =578 ' 70 

for the flow graph in FIG. SC. 

For a loop that terminates, cp(b 0 )<l. If the loop appears 
not to terminate, one could have cp(b 0 )^l. When this 
happens, cp(b 0 ) can be easily set to a value (less than 1) that 
represents the cyclic probability for spin loops. 

Now consider in FIG. 6 a procedure's flow graph with two 
loop heads, one of which is nested in the other. For this flow 
graph, one first finds the cyclic probability of the inner loop 
and then treat the outer loop the same manner as a single- 
level loop, except that one uses the formula 



Input: Control-flow graph G for function, in which each node is 
a basic block and each edge b-,-»t^ represents a branch from block 
b; to block h } . Each edge b ; — *bj has branch probability 
prob (bi-^bj). 

Output Assignments of frequency freq (bi-»bj) to edge ty-^bj and 
b&eq (b) to block b. 
Subroutine: propagate_freq (b, head) 
if b has been visited then 

return; 
// 1. find bfreq (b) 
if b = head then 
bn*eq(b)= 1; 

else 

foreach predecessor b p of b do 

if b p is not visited and (b p -»b) is not a back edge then 
return; 

bfrcq(b)-0; 

cyclic___probability = 0; 

foreach predecessor b p of b do 

if (b p -*b) is back edge to loop head b then 
cyclic_j>robability += back_edge__prob (b p ->b); 
else 

bfreq (b) = -H= freq (b ? -»b); 
if (cyclic_probability > 1 - epstlon) then 
cyclic_probability = 1 - epsilon; 



bfrcq(b) i_cy C li ^probability" 

// 2. calculate the frequencies of b's out edges 
mark b as visited 
foreach sucessor bj of b do 

freq (b-*bi) = prob (b->bj x b&eq (b); 

// update back edge_prob (b-vbj so it 

// can be used by outer loops to calculate 
H cyclic_probability of inner loops 
if bj = head then 
bacK_edge__prob (b-^bj) = prob (b-^bj x b&eq (b); 
// 3. propagate to successor blocks 
foreach successor b^ of b do 

if (b->bi) is not back edge then 
propagate freq (b^ head); 

Process: 
45 foreach edge do 

back_edge_prob (edge) = prob (edge); 
foreach loop from inner-most to out-most do 
let head be the head block of the loop 
mark all blocks reachable from head as not visited 
and mark all other blocks as Yisted 
propagatc_frcq (head, head); 



50 



The block and branch frequencies, which are derived from 
the branch probabilities as described above, are then stored 
by compiler 50 in a fields 70b and 70e, respectively, of the 

55 basic block data structure 70 in memory system 30. 

Optimization with Branch Probabilities, Block Frequencies 
and Branch Frequencies 

Referring again to FIG. 2, one optimization of the object 
code is carried out by code generator 64 of the compiler, 

60 which generates the object code from the intermediate code 
for storage in an object file 94. FIG. 7 is a flowchart of the 
steps taken by code generator 64 to optimize the object code. 
Initially the code generator reorders the basic blocks of a 
function (100). This is determined from the branch frequen- 

65 cies information in data structure fields 70e\ For example, 
with respect to FIG. 3, basic block B2 has branches to blocks 
B4 and B5. Each of these branches has a frequency, stored 
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in field 70e\ If the frequency of the branch between B2 and 
B4 is greater than the frequency of the branch between B2 
and B5, then the basic blocks may be rearranged to take 
advantage of this frequency. One possible arrangement is to 
place block B4 adjacent to block B2. Then, if the CPU 5 
running the program prefetches instructions, it will prefetch 
the instruction of block B4 ? which most likely follow the 
instructions of block B2. Once arranged, the intermediate 
code is translated into assembly code that runs on the target 
machine 20 (102). The assembly code is assembled into 10 
object code (104). Of course, the intermediate code may be 
translated directly into object code if the compiler is capable 
of doing so. The object code is then stored on object file 94 
(106), The code generator also generates a local call graph 
$6 for functions in the program from the intermediate code 15 
and the block frequency information (107). 
Computing Function Call Frequencies and Invocation Fre- 
quencies 

Referring again to FIG. 2, the object files for the various 
functions of a program are linked together and possibly with 20 
other files such as libraries* Any external references are 
resolved. As part of the linking, a linker 108 fills a data 
structure 110 contained in memory from information in 
object code file 94 and local call graph 96. The data structure 
comprises a global call graph with function call frequencies. 25 
Data structure 110 is shown in more detail in FIG. 8. It 
includes an array 112 of the program's functions whose 
elements each point to a linked list 114 of called functions. 
Each element of the array 112 has the structure indicated at 
116. Data structure 116 includes a field 116a that contains 30 
the function's invocation frequency (how often it is called); 
a field H6b that contains a pointer to the function's code; a 
field 116c that contains a pointer to a list of children 
(functions called by this function) and a field I16d that 
contains the function's ID. The elements of linked list 114 35 
are data structures such as indicated at 118. Data structure 
118 includes a field 118a that contains the local call fre- 
quency for the function; a field ll$b that contains a global 
call frequency for the function; a field 118c containing an 
index number identifying the structure as an element in the 40 
array 112; and a field HSd containing a pointer to the next 
called function in the linked list The end of the list is 
indicated by a null value in field H$d. 

The data stored in data structure 110 is used by linker 108 
to order the functions of the program. The local block 45 
frequencies are used for calculating the local frequency of 
calls on other functions. These local call frequencies are then 
propagated along call-graph edges to compute inter- 
procedural (or global) function invocation frequencies and 
call frequencies. Finally, global basic block and edge fre- so 
quencies may be obtained by multiplying each local fre- 
quency by its function's global invocation frequency. 

The local call frequency is the number of times that a 
function f calls a function g, assuming one invocation of f. 
This information is readily available from the function's 55 
block frequencies, computed previously. If function f calls g 
in blocks b ly . . . b* of function f, the local call frequency of 
f calling g is the combination of the frequencies of these 
blocks, such as by summing. The invocation frequency of a 
function f is the frequency with which function f itself is 60 
called during the execution of the program. The global call 
frequency of function f calling g is then the number of times 
that f calls g during all invocations of f , which is a combi- 
nation of the local call frequency and the invocation fre- 
quency of f . The combination in the preferred embodiment 65 
is the product of the local call frequency and the invocation 
frequency. It should be understood that function f may call 



a number of functions such as g, b, i, etc., with function g 
described above being an example, 

Computing global call frequencies from local call fre- 
quencies is similar to propagating branch probabilities in a 
flow graph. Assume cfreq(f) is the number of times that 
function f is called, lfreq(f,g) is the local frequency of f 
calling g, and g£req(f,g) is the global frequency of f calling 
g. The flow equations relating local and global call frequen- 



cfreq(f) 
cfreq(f) 



1 



Ifreq(p,f) 
pepred(f) 

lfreq(f\g) X cfrcq(f) 



(f is main function) 
(otherwise) 



A call graph is not reducible when a recursive cycle in the 
graph can be entered at several points. To handle these 
cycles, the branch and block frequency computation algo- 
rithm in Table 3 is modified. Each node that is the target of 
a back edge is treated as a loop head and, when calculating 
the cyclic probability for a loop head that is not the entry 
function, not using its descendants' cyclic probabilities. 

Table 4 below contains pseudocode describing an algo- 
rithm executed by linker 108 for calculating global call and 
function invocation frequencies. 

TABLE 4 



Input: A call graph, each node of which is a procedure and each edge 
fj— >fj represents a call from function ^ to fj. Edge fj-^fj has local 
call frequency lfreq (f } — >£j). 

Output Assignments of global function call frequency gfreq (fj-^) 
to edge f^fj and invocation frequency cfreq (f) to I 
Subroutine: propag atc_call_freq (£, head, final) 
if f has been visited then 

return; 
III. findc&eq(f) 
foreach predecessor fp of f do 

if fp is not visited and (fp — >f) is not a back edge then 
return; 
if f = head then 
cfreq (f) = 1; 

else 

cfreq (f) = 0, 
cyclic probability = 0; 
{breach predecessor fp of f do 

if final and (fp-»f) is a back edge then 

cyclic_prob ability 4= back__edge_prob (fp->f); 
else if (fp-*f) is not a back edge 
cfreq (f) += gfreq (fp-»f); 
if (cyclic_probability > 1 - epsilon) then 
cyclic_probabiliry = 1 — epsilon; 



// 2. calculate global call frequencies for f s out edges 
mark f as visited; 
foreach successor fi of f do 

gfreq (f-*fi) = lfreq (f-»fi) x cfreq (f); 

// update back w _edge_prob (f-»fi) so it can be 

II used by the outer-most loop to calculate 

// cyclic^ .probability of inner loops 

if fi = head and not final then 

back_edge_prob (f->fi) - lfreq (f-»fi) x cfreq (f); 
// 3. propagate to successor nodes 
foreach successor fi of f do 

if (f-»fi) is not a back edge then 

propagate call freq (fi, head, final) 

Process: 

foreach edge do 

back_edge_prob (edge) = lfreq (edge); 
foreach function f in reverse depth-first order do 
if f is a loop head then 

mark all nodes reachable from f as not visited 
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TABLE 4-continued 

and all other as visited. 
propagate_jcaU_freq (f, f, false) 
marie all nodes reachable from entry func as not 
visited and others as visited. 
propagate_calL_freq (entry func, entry func, true) 



Optimization with Function Frequencies 

Other optimizations of the program may be performed by 
ordering functions within the object code to improve 
memory reference locality, a process performed by linker 
108 as shown at 122 in FIG. 2. Typically a machine with 
virtual memory has its memory system organized as pages, 
with the main memory 38 holding several virtual pages and 
secondary storage 40 such as disk storing other virtual 
pages. Each page represents part or all of an executing 
program. If an executing program must call a function not 
contained in the memory pages, a page fault results and the 
target machine must bring a page from secondary storage 
that contains the called function into a memory page. This 
considerably slows the execution of the program. However, 
with the computed function call frequencies, the function 
can be ordered so that pairs f,g with higher global call 
frequencies are identified and stored in an order to improve 
the likelihood that they will be within the same memory 
page when loaded into main memory. More function calls 
are thus made to functions stored in main memory 38 rather 
than to functions stored in secondary storage 40, increasing 
the speed of program execution. 

The optimized object code produced by ordering the 
functions is then stored in an executable file 124. 
An Example 

The calculation of branch probabilities, branch and block 
frequencies and function invocation and call frequencies for 
optimizing a program is illustrated by the following 
example. Consider the atoi function in Table 5. FIG. 9 shows 
its flow graph. It contains two non-loop branches and one 
loop branch. Five heuristics apply to the first non-loop 
branch bo-*^, b 5 }, because b 0 performs a comparison of 
a pointer with NULL (PH), b t uses s without first defining 
it (GH) and stores to *val (SH), and b5 contains a call (CH) 
and a return (RH). Table 6 contains the heuristic probabili- 
ties. The combined probability of 0.95 for bo-^ is much 
higher than the probability estimated by any of the five 
heuristics. 

For the second non-loop branch b l -»{b 2 » b 4 }, both the 
Loop Header (LHH) and Return heuristics (RH) apply. Table 
7 summarizes the probabilities. 

TABLE 5 

/* permutation program: find 

all the permutations for {1, 2, . . . max} */ 
main(argc, argv) 
int axgc; 
char *argv[l; 

{ 

int max; 
char *a; 

atoi(argv[l], &max); 
a = (char *) malloc(max); 
permutefoOimax); 

> 

permut^a^max) 
char *a; 
int E^max; 

{ 

if (n — max) 



TABLE 5 -continued 



report_onfi( a,max) ; 

else 

5 permute_next_pos (a^max); 

} 

permute_jiexL»po$(a T n,max) 
char *a; 
int n^max; 

{ 

10 int: i; 

for (i=0;i < max;i++) { 

if ( ! in_j»sfix(i, a, n) ){ 
a[n] = i; 

permutc(a,n+l ,max); 

} > } 
15 reporL_one(a,max) 
char *a; 
int max; 

{ 

int i; 

for (i=0;i < max;i++) 
9rt printf("%c artH-'O'); 

zu putcharCuV); 
} 

int in_prefix(i, a, n) 
char *a; 
int i, n; 

int found = 0 j; 
for(j=Oa <ny++) 

i£(a(jj — i){ 
return 1; 

} 

return 0; 

30 } 

int atoi ( char *$, int *val) 
{ 

int i; 

if(s!=0){ 

*val = 0; 

35 for ( : * s ; 

*val = *val * 10 + *s - •0'; 
return 1; 

}else{ 

printf ("Invalid Input!\n*); 
return 0; 

40 > > 



TABLE 6 



45 



50 



55 



Heuristic 




V-*5 


CH 


.78 


.22 


RH 


.72 


.28 


SH 


.45 


.55 


PH 


.60 


.40 


GH 


.62 


.38 


Combined 


35 


.05 


TABLE 7 


Heuristic 




b,->b 4 


LHH 


.75 


.25 


RH 


.72 


.28 


Combined 


.88 


.12 



60 

The branch b 3 -»{b 3 , b 4 } contains a back edge, and only 
the loop branch heuristic (LBH) applies, so; 

prob(b 3 ->b3)=LBH(b3-»b 3 )=0.88 

prob(b 3 ->b 4 )=LBH(b 3 ->b 4 )=0. 12 
65 FIG. 10 shows the atoi flow graph labeled with each 
edge's branch probabilities. For the inner loop (b 3 -»b 3 ) the 
cyclic probability is the same as prob(b 3 — »b 3 ). For the outer 
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loop, the block frequencies and edge frequencies for the atoi 
function are calculated as follows: 

bbeq{b£= I 

treq(b 0 -» bj) = prob(6 0 -*t;)xl = 055 
freq(i 0 -> by) = prob(f>0 ->t,)xl = 0.05 
freq(i> ; ) = beq{b 0 6;) = 0.95 
froq(t j -*> 6 2 ) o protyfcj -> x 
fteq(fr 7 ) = 0.89 X 0.95 = 0.84 
fKq(bj b 4 ) = protyfc/ b 4 )xb 
&eq(t ; ) = 0.12x0.95 = 0.11 
frcq(6 3 ) = &eq(6, -> f> 2 ) = 0.84 



freqffcj -> bj) 
fc£re< ato)= l-piob(fc-»W 1-0.88 



0.84 



= 6.99 



freqfo -» = prob(fc 2 -* &,) x i>freq(fc2) = 1 x 0.84 
fi«q(tj -> *>j) = *&eq(i»5) x 0.84 = 6.15 
&*3(*j "> *V) = *&«q(^) x 0.12 = 0.84 
bbiq(b 4 ) = fircqfi, -+ 6<) + fxcq(b 3 -> = 0.U + 0.84 
= 055 

fcfreq^) = freq(£ 0 -> 6 5 ) a 0.05 

FIG. 11 shows the block and edge frequencies. Note that 
because one starts the entry block with a frequency of one, 
the exit blocks' total frequency is also one. That is, freq 
(bo-^b^-ffteqCb^b^freqCba-^b^O.OS+O.ll-fO.S^l 

The atoi function calls printf in block b 5 and block b/s 
frequency is 0.05. So, in the call graph, atoi calls printf with 
a local frequency of 0.05. This process is continued to find 
the local call frequencies for the permutation program, 
shown in FIG. 12A. 

As shown in FIG. 12A, the permutation program of Table 
5 has a recursive cycle in which the permute function is the 
head The first call to propagate_call_freq updates lfreq 
(permute_next_j)os-»perniute) from 1.52 to 0.5x1.52= 
0.76. In the final call to propagate_caU_freq, the global call 
frequency and invocation frequencies for the various func- 
tions are obtained. They are shown in FIG. 12B. The 
(report__one, printf) function pair has the highest global call 
frequency and may have priority in ordering the functions. 

Table 8 lists local branch frequencies in its third column, 
function invocation frequencies in its fourth column, and 
global branch frequencies in its fifth column. 

TABLE 8 



15 



20 



25 



30 



35 



40 







local 


fUDC. 


global 






edge 


invoc. 


edge 


functions 


edges 


&cq. 


freq. 


freq. 


repoiCone 


b0-rt>3 


.02 


2.1 


.05 


report one 


b0-rt>5 


.98 


2.1 


2.1 


report_ODe 


bi-+b3 


SB 


2.1 


2.1 


teport_one 


bl->bl 


7.2 


2.1 


15.0 


repoit_ODc 


b5-*bl 


SB 


2.1 


2.1 


ia_prefix 


b0->b5 


.02 


17.1 


.41 


iiL_prefix 


b0->b7 


.98 


17.1 


16.7 


iiL-prefbt 


bl-rt>2 


.29 


17.1 


AS 


iu_prefix 


bl-»b3 


SS 


17.1 


10.0 




b3->b5 


.70 


17.1 


12.0 


in__prcfix 


b3-*bl 


5.2 


17.1 


88.0 


u\_prt£x 


b7-»bl 


.98 


17.1 


16.7 


pcnnutc_ii«t_pos 


b0-rt>5 


.02 


2.1 


0.5 


penmrte_n«L_pos 


b0-+b7 


58 


2.1 


2.1 


pcnEutc_iicxt_pos 


bl->b2 


1.5 


2.1 


23.2 


pcrmutc_jicxt_pos 


bl-rt>3 


6.6 


2.1 


135 


peimute_iiext_pos 


b2-rt>3 


1.5 


2.1 


3.2 


permute_next_pos 


b3->b5 


SB 


2.1 


2.1 


permute_n«t_pos 


b3->bl 


7.2 


2.1 


15.0 


pennu1e_next_pos 


b7-*bl 


.98 


2.1 


2.1 


atoi 


b0->bl 


SS 


1.0 


SS 


atoi 


b0-M>5 


.05 


1.0 


.05 


atoi 


bl-+b4 


.11 


1.0 


.11 


atoi 


bl-rt>2 


.84 


1,0 


.84 


atoi 


b2->b3 


.84 


1.0 


.84 



45 



50 



S5 



60 
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TABLE 8-continued 







local 


fUDC. 


global 






edge 


invoc. 


edge 


functions 


edges 


freq. 


freq. 


freq. 


atoi 


b3-rt>3 


6.2 


1.0 


6.2 


atoi 


b3->b4 


.84 


1.0 


.84 


permute 


b0->bl 


.50 


4.2 


2.1 


permute 


bO->b2 


JO 


4.20 


2.1 



Having illustrated and described the principles of the 
invention in a preferred embodiment, it should be apparent 
to those skilled in the art that the embodiment can be 
modified in arrangement and detail without departing from 
such principles. In view of the many possible embodiments 
to which the principles of our invention may be applied, it 
should be recognized that the illustrated embodiment is only 
a preferred example of the invention and should not be taken 
as a limitation on the scope of the invention. Rather, the 
invention is defined, by the following claims. I therefore 
claim as our invention all that comes within the scope and 
spirit of these claims. 

I claim: 

1. A method of compiling a computer program, the 
method comprising the following steps: 

associating probabilities with a plurality of heuristic pre- 
dictions; 

generating intermediate code from source code of the 
program; 

partitioning the intermediate code into basic blocks; 
storing the basic blocks in a basic block data structure 

including the intermediate code and a branch to other 

basic blocks; 
for a branch: 

detennining which heuristic predictions apply to the 
branch; and 

if more than one heuristic prediction applies, combin- 
ing the associated probabilities of at least two heu- 
ristic predictions that apply to compute a probability 
of the branch being taken by the program; 

storing the branch probabilities; 

generating object code from the intermediate code; and 

storing the object code in an order based on the branch 
probabilities. 

2. The method of claim 1 wherein the step of storing the 
object code comprises: 

computing branch frequencies from the branch probabili- 
ties; and 

storing the object code in an order based on the branch 
frequencies. 

3. The method of claim 2 wherein the step of computing 
branch frequencies from branch probabilities comprises: 

detenrnning, from the branch probabilities, frequencies at 
which the basic blocks of intermediate code are 
executed; 

for a branch, computing a branch frequency by combining 
the frequency of the basic block from which the branch 
is taken with the branch probability. 

4. The method of claim 1 wherein functions of a program 
each comprise a group of basic blocks, the method including 
the following steps: 

computing, from the branch probabilities, frequencies for 

the basic blocks of a function f; 
combining the frequencies of basic blocks in the function 

f which include a call to a function g to obtain a local 
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call frequency lfreq(f,g), the local call frequency being 
the frequency with which the calling function f calls 
function g assuming one invocation of function f; 

determining a function invocation frequency cfreq(f), the 
function invocation frequency being the frequency with 
which the function f is itself called in the program; 

combining the function invocation frequency with the 
local call frequency to obtain a global call frequency 
gfreq(f,g) for each function pair f and g; and 

storing the functions of the program in an order based on 
their global call frequencies. 

5. The method of claim 1 wherein the step of storing the 
object code in an order based on the hranch probabilities 
comprises storing the object code such that instructions 
following a branch with a higher branch probability are 
stored close to instructions preceding the branch. 

6. The method of claim 1 wherein the steps of generating 
and storing the object code comprise: 

computing branch frequencies from the branch probabili- 
ties; 

storing the branch frequencies; 

storing the basic blocks of intermediate code in an order 

based on the branch frequencies; 
generating object code from the ordered intermediate 

code; and 
storing the object code. 

7. The method of claim 1 wherein the combining step 
comprises combining the associated probabilities of at least 
two heuristic predictions according to the following: 



10 



15 



20 



miOirn(B) = 



a UXy mi{X)m % {Y) 



40 



wherein ml and m2 are basic probability assignments of 
heuristics. Ais a set of possible branching target blocks, 
B is a subset of A, X and Y are subsets of A whose 
intersection is B, and U and W are subsets of A with at 
least one element in common. 

8. The method of claim 1 wherein the branch is one of two 
possible branches and the combining step comprises com- 
bining the associated probabilities of two heuristic predic- 
tions according to the following: 

probability of branch b=u*v/(u*v+(l-u)(l-v)) where u 45 
and v are b's taken probabilities of two heuristic 
predictions that apply to the branch. 

9. The method of claim 1 wherein each of the heuristic 
predictions is one of a loop branch heuristic, pointer 
heuristic, opcode heuristic, guard heuristic, loop exit 
heuristic, loop header heuristic, call heuristic, store heuristic 
and return heuristic. 

10. The method of claim 1 including determining the 
probabilities of the heuristic predictions from a run time 
measurement of a set of computer programs. 

11. The method of claim 1 in which the basic block data 
structure comprises a field containing the intermediate code, 
a field for a basic block frequency, a field for each branch to 
other basic blocks, and, for an indicated branch, a field for 
branch frequency and field for branch probability. 

12. A computer-readable medium on which is stored a 
computer program comprising instruction for executing the 
method of claim 1. 

13. A method of compiling a computer program, the 
method comprising the following steps: 

generating intermediate code from source code of the 
program; 



partitioning the intermediate code into basic blocks; 
storing the basic blocks in a basic block data structure 
including intermediate code and a branch to other basic 
blocks; 

computing for the branches a probability that a branch is 
taken; 

for a block that is a loop head, detenmning a block 
frequency bfreq(bo) by computing a cyclic probability 
cp(b 0 ) for the block and computing the block frequency 
according to the following: 

bfreq(b 0 )=in_jTeq(b 0 )/(l-cp)(b 0 )) where in_freq(b 0 ) 
is a sum of branch frequencies into block b G from 
non-loop branches; 
for a block that is not a loop head, determining a block 
frequency bfreqfo) by summing branch frequencies 
into the block; 
computing branch frequencies by combining the fre- 
quency of a basic block from which the branch is 
token with the branch probability; 
generating object code from the intermediate code; and 
storing the object code in an order based on the branch 
frequencies. 

14. The method of claim 13 wherein functions of a 
program each comprise a group of the basic blocks, the 

25 method including the following steps; 

computing block frequencies for the basic blocks of a 
function f; 

combining the frequencies of basic blocks in the function 
f which include a call to a function g to obtain a local 
call frequency lfreq(f,g), the local call frequency being 
the frequency with which the calling function f calls 
function g assuming one invocation of function f; 
determining a function invocation frequency cfreq(f), the 
function invocation frequency being the frequency with 
which the function f is itself called in the program; 
combining the function invocation frequency with the 
local call frequency for each called function to obtain 
a global call frequency gfreq(f,g) for each function pair 
f and g; and 

storing the functions of the program in an order based on 
their global call frequencies. 

15. The method of claim 14 wherein the storing step 
comprises storing the functions of the program such that 
function pairs with a higher global call frequency are stored 
within a same virtual memory page. 

16. A computer-readable medium on which is stored a 
computer program comprising instruction for executing the 
method of claim 13. 

17. An apparatus for compiling a computer program, 
comprising: 

a stored table of probabilities associated with a plurality 

of heuristic predictions; 
a compiler front end for generating intermediate code 

from source code of the program; 
a code analyzer for partitioning the intermediate code into 
basic blocks; 

a plurality of basic block data structures contained in 
memory, the data structures each including intermedi- 
ate code of a basic block and a branch to other basic 
blocks; 

a branch analyzer for determining which heuristic predic- 
tions apply to a branch and, if more than one heuristic 
prediction applies, combining the associated probabili- 
ties of at least two heuristic predictions that apply to 
compute a probability of the branch being taken by the 
program; and 
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a code generator for generating object code from the 
intermediate code and for storing the object code in an 
order based on the branch probabilities. 

18. The apparatus of claim 17 wherein the code generator 
generates assembly code from the intermediate code and 5 
includes an assembler for assembling the assembly code into 
binary object code. 

19. The apparatus of claim 17 wherein the branch ana- 
lyzer is constructed to compute branch frequencies from the 
branch probabilities, the object code stored in an order based 10 
on the computed branch frequencies. 

20. The apparatus of claim 17 wherein the basic block 
data structure comprises a field containing the intermediate 
code, a field for a basic block frequency, a field for each 
branch to other basic blocks, and, for an indicated branch, a is 
field for branch frequency and a field for branch probability. 

21. An apparatus for compiling a computer program, 
comprising: 

a compiler front end for generating intermediate code 
from source code of the program; 20 

a code analyzer for partitioning the intermediate code into 
basic blocks; 

a plurality of basic block data structures, a basic block 
data structure including intermediate code of a basic ^ 
block and a branch to other basic blocks; 

branch probabilities stored for the branches; 

a branch analyzer for computing branch frequencies from 
the branch probabilities in the following manner: 
for a block that is a lood head, determining a block 30 
frequency bfreq(b 0 ) by computing a cyclic probabil- 
ity cp(b 0 ) for the block and computing the block 
frequency according to the following: bfreq(bo)=in_ 
freq(b 0 Y(l-cp)(b 0 )) where in_Jreq(bo) is a sum of 
branch frequencies into block b 0 from non-loop 35 
branches; 

for a block that is not a loop head, determining a block 

frequency bfreq(b<) by summing branch frequencies 

into the block; 
computing branch frequencies by combining the fre- 40 

quency of a basic block from which the branch is 

taken with the branch probability; 
a code generator for generating object code from the 

intermediate code; and 
means for storing the object code in an order based on the 45 
branch frequencies. 

22. The apparatus of claim 21 including: 

a data structure contained in memory for storing local call 
frequencies lfreq(f,g), a local call frequency being the SQ 
frequency with which a calling function f calls a 
function g within a single invocation of function f; 

means for determining a function invocation frequency 
cfreq(f), a function invocation frequency being the 
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frequency with which a calling function f is itself called 
in the program; 

means for combining the function invocation frequency 
with the local call frequency for each called function to 
obtain a global call frequency gfreq(f,g) for each func- 
tion pair f and g; and 

means for storing the functions of the program in an order 
based on their global call frequencies. 

23. The apparatus of claim 22 wherein the data structure 
contains for functions in the program a field for the local call 
frequency, a field for the global call frequency and a field for 
the invocation frequency. 

24. A computer- implemented method of predicting branch 
probability in the compiling of a computer program, the 
method comprising the following steps: 

associating probabilities with a plurality of heuristic pre- 
dictions; 

identifying a branch in a program; 
for a branch: 

deterrnining which heuristic predictions apply to the 
branch; and 

if more than one heuristic prediction applies, combin- 
ing the associated probabilities of at least two heu- 
ristic predictions that apply to compute a probability 
of the branch being taken by the program. 

25. The method of claim 24 wherein the associated 
probability for a heuristic prediction is derived from empiri- 
cal data. 

26. A computer-implemented method of determining 
branch frequencies for basic blocks in the compiling of a 
computer program, the method comprising the following 
steps: 

identifying branches in a computer program; 
computing for the branches a probability that a branch is 
taken; 

for a block that is a loop head, determining a block 
frequency bfreq(bo) by computing a cyclic probability 
cp(b 0 ) for the block and computing the block frequency 
according to the following: 

bfrec;(b 0 )^_jfreq(b 0 )/(l-cp(b 0 )) where in_Jreq(b 0 ) is a 
sum of branch frequencies into block b 0 from non-loop 
branches; 

for a block that is not a loop head, determining a block 
frequency bfreq(bj) by summing branch frequencies 
into the block; and 

computing branch frequencies by combining the fre- 
quency of a basic block from which the branch is taken 
with the branch probability. 

***** 
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